[UR][L0v2] Migrate discrete buffer through host when P2P is not accessible#22010
Conversation
|
Please review @intel/unified-runtime-reviewers-level-zero |
| // Migrate buffer through the host: copy from the current device to a | ||
| // temporary host buffer, then from host to the target device. | ||
| auto bufferSize = getSize(); | ||
| std::vector<char> hostBuf(bufferSize); |
There was a problem hiding this comment.
nit: maybe it is worth to consider USM allocation in place of heap, like in line 100
| for (uint32_t i = 0; i < waitListView.num; i++) { | ||
| ZE2UR_CALL_THROWS(zeEventHostSynchronize, | ||
| (waitListView.handles[i], UINT64_MAX)); | ||
| } |
There was a problem hiding this comment.
I don't think this will work. The operation also needs to be ordered with regards to the command list itself, so something like this will be better:
if (numWaitEvents > 0) {
ZE2UR_CALL(zeCommandListAppendWaitOnEvents,
(zeCommandList.get(), numWaitEvents, pWaitEvents));
}
ZE2UR_CALL(zeCommandListHostSynchronize, (zeCommandList.get(), UINT64_MAX));
| auto bufferSize = getSize(); | ||
| std::vector<char> hostBuf(bufferSize); | ||
|
|
||
| UR_CALL_THROWS(synchronousZeCopy(hContext, activeAllocationDevice, |
There was a problem hiding this comment.
I don't like the fact that this is synchronous. Can you explore what it would take to make it async? I think we'd need to keep the allocation somewhere.
There was a problem hiding this comment.
Changed. Is it OK now?
|
@mateuszpn @pbalcer re-review please |
1 similar comment
|
@mateuszpn @pbalcer re-review please |
9727548 to
1e9d552
Compare
…sible When a buffer on a discrete GPU needs to be accessed from a different device and P2P access is not enabled, migrate the data through a USM HOST staging buffer instead of returning UR_RESULT_ERROR_UNSUPPORTED_FEATURE. The migration uses a two-step copy: 1. Synchronous device->host copy using the source device's own command list (the destination device cannot reach source device memory without P2P). 2. Async host->device copy enqueued on the caller's command list (host memory is accessible by all devices, so this is safe). Before the device->host copy, any pending operations on the caller's command list are ordered and drained via zeCommandListAppendWaitOnEvents + zeCommandListHostSynchronize, ensuring prior kernel writes to the source buffer are visible. A fully synchronous fallback is used when no command list is available (e.g. urMemGetNativeHandle). Only one staging buffer is kept alive at a time: it is released at the start of the next migration after zeCommandListHostSynchronize confirms the previous async copy has completed. A new ensureDeviceAlloc helper allocates the destination device buffer without the activeAllocationDevice side-effect of allocateOnDevice, so the active-device state is only updated after the async copy is successfully enqueued. Fixes: intel#22007 Fixes: intel#22008 Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Add four conformance tests exercising discrete buffers accessed from two different device queues when P2P access is not available. Tests covering the async migration path (cmdList != nullptr, triggered by urEnqueueMem* operations): - AsyncFillThenReadOnSecondQueueWithWait: fills a buffer on queues[0] and reads it on queues[1] using an explicit event dependency. - PingPongFillBetweenTwoDeviceQueues: alternates fills between queues[0] and queues[1], each read on the opposite queue using event dependencies. - ChainedAsyncOpsAcrossQueuesWithEvents: chains fill, blocking write, and read across two queues using cross-queue events. Test covering the synchronous fallback path (cmdList == nullptr, triggered by urMemGetNativeHandle): - SyncFallbackMigrationViaNativeHandle: fills the buffer on device 0, calls urMemGetNativeHandle for device 1 to trigger synchronous host-staged migration, then verifies the data on device 1. All tests add an explicit queues.size() < 2 guard (GTEST_SKIP) in case the fixture minimum-device requirement changes, and cross-queue ordering is expressed with events throughout to properly exercise the async migration path. A dedicated L0 v2 adapter runner (discrete_buffer_host_migration.cpp) reuses the conformance test source under UR_LOADER_USE_LEVEL_ZERO_V2. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
The test was intermittently failing on CI hardware because the queue create + USM fill + urQueueFinish sequence before the memory measurement introduced a multi-millisecond time window. During that window, async driver cleanup from earlier P2P tests (which can fail to evict peer residency via zeContextEvictMemory) or concurrent GPU workloads on shared CI machines could change devices[1]'s GLOBAL_MEM_FREE reading enough to trigger the assertion. The queue/fill/finish operations are not needed to test the residency property: zeContextMakeMemoryResident is invoked at urUSMDeviceAlloc time, so measuring immediately after the allocation captures any peer-residency side-effects without a blocking GPU operation in between. Remove those operations to keep the measurement window as short as possible, matching the pattern already used in allocationInitiallyAbsentOnPeer. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
|
@mateuszpn @pbalcer re-review please |
There was a problem hiding this comment.
do sync after this cmdlist.
There was a problem hiding this comment.
with the sync after final copy, migration staging buffer isn't needed.
When a buffer on a discrete GPU needs to be accessed from a different
device and P2P access is not enabled, migrate the data through a USM
HOST staging buffer instead of returning UR_RESULT_ERROR_UNSUPPORTED_FEATURE.
The migration uses a two-step copy:
list (the destination device cannot reach source device memory
without P2P).
memory is accessible by all devices, so this is safe).
Before the device->host copy, any pending operations on the caller's
command list are ordered and drained via zeCommandListAppendWaitOnEvents
source buffer are visible. A fully synchronous fallback is used when
no command list is available (e.g. urMemGetNativeHandle).
Only one staging buffer is kept alive at a time: it is released at the
start of the next migration after zeCommandListHostSynchronize confirms
the previous async copy has completed.
A new ensureDeviceAlloc helper allocates the destination device buffer
without the activeAllocationDevice side-effect of allocateOnDevice,
so the active-device state is only updated after the async copy is
successfully enqueued.
Fixes: #22007
Fixes: #22008